In this markdown we will use the Diabetes Health Indicator Dataset in-order to make some predictions on a person having diabetes based on his/her health condition. First we implement all the requiered libraries in the below section. we will answer the questions afterwards…
library(ggplot2)
library(vioplot)
## Warning: package 'vioplot' was built under R version 4.3.1
## Loading required package: sm
## Warning: package 'sm' was built under R version 4.3.1
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.1
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.3.1
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.1
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(psych)
## Warning: package 'psych' was built under R version 4.3.1
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(caTools)
## Warning: package 'caTools' was built under R version 4.3.1
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.1
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.1
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:ggplot2':
##
## margin
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.1
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.1
now we load the data
data1 <- read.csv("./datasets/diabetes_binary_health_indicators_BRFSS2015.csv")
data2 <- read.csv("./datasets/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")
data3 <- read.csv("./datasets/diabetes_012_health_indicators_BRFSS2015.csv")
throughout this markdown we will check the performance of our models for all 3 datasets. now let’s take a look at the data!
print(paste("dimension of the 1st dataset: " ,dim(data1)))
## [1] "dimension of the 1st dataset: 253680"
## [2] "dimension of the 1st dataset: 22"
print("first rows of the 1st dataset")
## [1] "first rows of the 1st dataset"
head(data1)
## Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## 1 0 1 1 1 40 1 0
## 2 0 0 0 0 25 1 0
## 3 0 1 1 1 28 0 0
## 4 0 1 0 1 27 0 0
## 5 0 1 1 1 24 0 0
## 6 0 1 1 1 25 1 0
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1 0 0 0 1 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 1 1 1 0
## 5 0 1 1 1 0
## 6 0 1 1 1 0
## AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1 1 0 5 18 15 1 0 9
## 2 0 1 3 0 0 0 0 7
## 3 1 1 5 30 30 1 0 9
## 4 1 0 2 0 0 0 0 11
## 5 1 0 2 3 0 0 0 11
## 6 1 0 2 0 2 0 1 10
## Education Income
## 1 4 3
## 2 6 1
## 3 4 8
## 4 3 6
## 5 5 4
## 6 6 8
print("number of unique diabetes/non-diabetes occurances")
## [1] "number of unique diabetes/non-diabetes occurances"
table(data1$Diabetes_binary)
##
## 0 1
## 218334 35346
print(paste("dimension of the 2snd dataset: " ,dim(data2)))
## [1] "dimension of the 2snd dataset: 70692"
## [2] "dimension of the 2snd dataset: 22"
print("first rows of the 2nd dataset")
## [1] "first rows of the 2nd dataset"
head(data2)
## Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## 1 0 1 0 1 26 0 0
## 2 0 1 1 1 26 1 1
## 3 0 0 0 1 26 0 0
## 4 0 1 1 1 28 1 0
## 5 0 0 0 1 29 1 0
## 6 0 0 0 1 18 0 0
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1 0 1 0 1 0
## 2 0 0 1 0 0
## 3 0 1 1 1 0
## 4 0 1 1 1 0
## 5 0 1 1 1 0
## 6 0 1 1 1 0
## AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1 1 0 3 5 30 0 1 4
## 2 1 0 3 0 0 0 1 12
## 3 1 0 1 0 10 0 1 13
## 4 1 0 3 0 3 0 1 11
## 5 1 0 2 0 0 0 0 8
## 6 0 0 2 7 0 0 0 1
## Education Income
## 1 6 8
## 2 6 8
## 3 6 8
## 4 6 8
## 5 5 8
## 6 4 7
print("number of unique diabetes/non-diabetes occurances")
## [1] "number of unique diabetes/non-diabetes occurances"
table(data2$Diabetes_binary)
##
## 0 1
## 35346 35346
print(paste("dimension of the 3rd dataset: " ,dim(data3)))
## [1] "dimension of the 3rd dataset: 253680"
## [2] "dimension of the 3rd dataset: 22"
print("first rows of the 3rd dataset")
## [1] "first rows of the 3rd dataset"
head(data3)
## Diabetes_012 HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack
## 1 0 1 1 1 40 1 0 0
## 2 0 0 0 0 25 1 0 0
## 3 0 1 1 1 28 0 0 0
## 4 0 1 0 1 27 0 0 0
## 5 0 1 1 1 24 0 0 0
## 6 0 1 1 1 25 1 0 0
## PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
## 1 0 0 1 0 1 0
## 2 1 0 0 0 0 1
## 3 0 1 0 0 1 1
## 4 1 1 1 0 1 0
## 5 1 1 1 0 1 0
## 6 1 1 1 0 1 0
## GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 1 5 18 15 1 0 9 4 3
## 2 3 0 0 0 0 7 6 1
## 3 5 30 30 1 0 9 4 8
## 4 2 0 0 0 0 11 3 6
## 5 2 3 0 0 0 11 5 4
## 6 2 0 2 0 1 10 6 8
print("number of unique diabetes/non-diabetes occurances")
## [1] "number of unique diabetes/non-diabetes occurances"
table(data3$Diabetes_binary)
## < table of extent 0 >
now we will check if there are any duplicate rows
duplicates1 <- sum(duplicated(data1))
p1 <- duplicates1 / nrow(data1)
duplicates2 <- sum(duplicated(data2))
p2 <- duplicates2 / nrow(data2)
duplicates3 <- sum(duplicated(data3))
p3 <- duplicates3 / nrow(data3)
print(paste("Number of duplicate rows in the first dataset: ", duplicates1, "which is ", p1*100,"%"))
## [1] "Number of duplicate rows in the first dataset: 24206 which is 9.54194260485651 %"
print(paste("Number of duplicate rows in the second dataset: ", duplicates2, "which is ", p2*100,"%"))
## [1] "Number of duplicate rows in the second dataset: 1635 which is 2.3128501103378 %"
print(paste("Number of duplicate rows in the third dataset: ", duplicates3, "which is ", p3*100,"%"))
## [1] "Number of duplicate rows in the third dataset: 23899 which is 9.42092399873857 %"
we can see that a little, but noticeable, portion of data is duplicated. however it doesn’t have a problem because this data is gathered by asking people about their conditions and duplicates are completely natural.
now we will check if there are any NA rows in the data
null1 <- sum(is.na(data1))
null2 <- sum(is.na(data2))
null3 <- sum(is.na(data3))
print(paste("Number of not available rows in the first dataset: ", null1))
## [1] "Number of not available rows in the first dataset: 0"
print(paste("Number of not available rows in the second dataset: ", null2))
## [1] "Number of not available rows in the second dataset: 0"
print(paste("Number of not available rows in the third dataset: ", null3))
## [1] "Number of not available rows in the third dataset: 0"
as we can see there are no NA rows so we can continue without worrying about missing data.
now let’s see the number of distinct values in each column of our data
distinct_values <- function(data_frame) {
for (col in colnames(data_frame)) {
distinct_count <- length(unique(data_frame[[col]]))
print(paste("Column:", col, "number of distinctvalues:",
distinct_count))
}
}
print("-Dataset1-")
## [1] "-Dataset1-"
distinct_values(data1)
## [1] "Column: Diabetes_binary number of distinctvalues: 2"
## [1] "Column: HighBP number of distinctvalues: 2"
## [1] "Column: HighChol number of distinctvalues: 2"
## [1] "Column: CholCheck number of distinctvalues: 2"
## [1] "Column: BMI number of distinctvalues: 84"
## [1] "Column: Smoker number of distinctvalues: 2"
## [1] "Column: Stroke number of distinctvalues: 2"
## [1] "Column: HeartDiseaseorAttack number of distinctvalues: 2"
## [1] "Column: PhysActivity number of distinctvalues: 2"
## [1] "Column: Fruits number of distinctvalues: 2"
## [1] "Column: Veggies number of distinctvalues: 2"
## [1] "Column: HvyAlcoholConsump number of distinctvalues: 2"
## [1] "Column: AnyHealthcare number of distinctvalues: 2"
## [1] "Column: NoDocbcCost number of distinctvalues: 2"
## [1] "Column: GenHlth number of distinctvalues: 5"
## [1] "Column: MentHlth number of distinctvalues: 31"
## [1] "Column: PhysHlth number of distinctvalues: 31"
## [1] "Column: DiffWalk number of distinctvalues: 2"
## [1] "Column: Sex number of distinctvalues: 2"
## [1] "Column: Age number of distinctvalues: 13"
## [1] "Column: Education number of distinctvalues: 6"
## [1] "Column: Income number of distinctvalues: 8"
print("-Dataset2-")
## [1] "-Dataset2-"
distinct_values(data2)
## [1] "Column: Diabetes_binary number of distinctvalues: 2"
## [1] "Column: HighBP number of distinctvalues: 2"
## [1] "Column: HighChol number of distinctvalues: 2"
## [1] "Column: CholCheck number of distinctvalues: 2"
## [1] "Column: BMI number of distinctvalues: 80"
## [1] "Column: Smoker number of distinctvalues: 2"
## [1] "Column: Stroke number of distinctvalues: 2"
## [1] "Column: HeartDiseaseorAttack number of distinctvalues: 2"
## [1] "Column: PhysActivity number of distinctvalues: 2"
## [1] "Column: Fruits number of distinctvalues: 2"
## [1] "Column: Veggies number of distinctvalues: 2"
## [1] "Column: HvyAlcoholConsump number of distinctvalues: 2"
## [1] "Column: AnyHealthcare number of distinctvalues: 2"
## [1] "Column: NoDocbcCost number of distinctvalues: 2"
## [1] "Column: GenHlth number of distinctvalues: 5"
## [1] "Column: MentHlth number of distinctvalues: 31"
## [1] "Column: PhysHlth number of distinctvalues: 31"
## [1] "Column: DiffWalk number of distinctvalues: 2"
## [1] "Column: Sex number of distinctvalues: 2"
## [1] "Column: Age number of distinctvalues: 13"
## [1] "Column: Education number of distinctvalues: 6"
## [1] "Column: Income number of distinctvalues: 8"
print("-Dataset3-")
## [1] "-Dataset3-"
distinct_values(data3)
## [1] "Column: Diabetes_012 number of distinctvalues: 3"
## [1] "Column: HighBP number of distinctvalues: 2"
## [1] "Column: HighChol number of distinctvalues: 2"
## [1] "Column: CholCheck number of distinctvalues: 2"
## [1] "Column: BMI number of distinctvalues: 84"
## [1] "Column: Smoker number of distinctvalues: 2"
## [1] "Column: Stroke number of distinctvalues: 2"
## [1] "Column: HeartDiseaseorAttack number of distinctvalues: 2"
## [1] "Column: PhysActivity number of distinctvalues: 2"
## [1] "Column: Fruits number of distinctvalues: 2"
## [1] "Column: Veggies number of distinctvalues: 2"
## [1] "Column: HvyAlcoholConsump number of distinctvalues: 2"
## [1] "Column: AnyHealthcare number of distinctvalues: 2"
## [1] "Column: NoDocbcCost number of distinctvalues: 2"
## [1] "Column: GenHlth number of distinctvalues: 5"
## [1] "Column: MentHlth number of distinctvalues: 31"
## [1] "Column: PhysHlth number of distinctvalues: 31"
## [1] "Column: DiffWalk number of distinctvalues: 2"
## [1] "Column: Sex number of distinctvalues: 2"
## [1] "Column: Age number of distinctvalues: 13"
## [1] "Column: Education number of distinctvalues: 6"
## [1] "Column: Income number of distinctvalues: 8"
as the final step of our pre-processing stage we will take a look at the summary of all datasets
summary(data1)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.1393 Mean :0.429 Mean :0.4241 Mean :0.9627
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:24.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :27.00 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :28.38 Mean :0.4432 Mean :0.04057 Mean :0.09419
## 3rd Qu.:31.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.00000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7565 Mean :0.6343 Mean :0.8114 Mean :0.0562
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :2.000 Median : 0.000
## Mean :0.9511 Mean :0.08418 Mean :2.511 Mean : 3.185
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:3.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 6.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 8.000
## Mean : 4.242 Mean :0.1682 Mean :0.4403 Mean : 8.032
## 3rd Qu.: 3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:10.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.00 Min. :1.000
## 1st Qu.:4.00 1st Qu.:5.000
## Median :5.00 Median :7.000
## Mean :5.05 Mean :6.054
## 3rd Qu.:6.00 3rd Qu.:8.000
## Max. :6.00 Max. :8.000
summary(data2)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.5 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :0.5 Mean :0.5635 Mean :0.5257 Mean :0.9753
## 3rd Qu.:1.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :29.00 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :29.86 Mean :0.4753 Mean :0.06217 Mean :0.1478
## 3rd Qu.:33.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.0000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1.000 Median :1.0000 Median :1.0000 Median :0.00000
## Mean :0.703 Mean :0.6118 Mean :0.7888 Mean :0.04272
## 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.000 Median :0.00000 Median :3.000 Median : 0.000
## Mean :0.955 Mean :0.09391 Mean :2.837 Mean : 3.752
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:4.000 3rd Qu.: 2.000
## Max. :1.000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.00 Min. :0.0000 Min. :0.000 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.: 7.000
## Median : 0.00 Median :0.0000 Median :0.000 Median : 9.000
## Mean : 5.81 Mean :0.2527 Mean :0.457 Mean : 8.584
## 3rd Qu.: 6.00 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:11.000
## Max. :30.00 Max. :1.0000 Max. :1.000 Max. :13.000
## Education Income
## Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:4.000
## Median :5.000 Median :6.000
## Mean :4.921 Mean :5.698
## 3rd Qu.:6.000 3rd Qu.:8.000
## Max. :6.000 Max. :8.000
summary(data3)
## Diabetes_012 HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.2969 Mean :0.429 Mean :0.4241 Mean :0.9627
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :2.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:24.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :27.00 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :28.38 Mean :0.4432 Mean :0.04057 Mean :0.09419
## 3rd Qu.:31.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.00000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7565 Mean :0.6343 Mean :0.8114 Mean :0.0562
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :2.000 Median : 0.000
## Mean :0.9511 Mean :0.08418 Mean :2.511 Mean : 3.185
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:3.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 6.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 8.000
## Mean : 4.242 Mean :0.1682 Mean :0.4403 Mean : 8.032
## 3rd Qu.: 3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:10.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.00 Min. :1.000
## 1st Qu.:4.00 1st Qu.:5.000
## Median :5.00 Median :7.000
## Mean :5.05 Mean :6.054
## 3rd Qu.:6.00 3rd Qu.:8.000
## Max. :6.00 Max. :8.000
before we go further, we make a small change in our dataframes. it is required because we want to draw plots
data1$Diabetes_binary <- as.factor(data1$Diabetes_binary)
data2$Diabetes_binary <- as.factor(data2$Diabetes_binary)
data3$Diabetes_012 <- as.factor(data3$Diabetes_012)
now we will take a deeper look to our data and do some exploratory analysis
first let’s take a look at the distribution of data according to each parameter. we use violin plots for this purpose.
par(mfrow = c(1, 3))
for (col_ in colnames(data1)){
if (col_ == "Diabetes_binary"){
next
}
vioplot(data1[[col_]], col = "yellow", border = "black",
horizontal = FALSE, xlab = "dataset1", ylab = "Values",
main = col_)
vioplot(data2[[col_]], col = "orange", border = "black",
horizontal = FALSE, xlab = "dataset2", ylab = "Values",
main = col_)
vioplot(data3[[col_]], col = "red", border = "black",
horizontal = FALSE, xlab = "dataset3", ylab = "Values",
main = col_)
}
now that we have some idea about the distribution of each variable, let’s compare the distribution of each variable with seperated responses. we do this to see if there is a meaningful difference in the distribution. and we use KDE plots this time
for (i in 2:22){
colnames(data1)[i]
p1 <- ggplot(data1, aes(x = data1[, i], fill = Diabetes_binary)) +
geom_density(alpha = 0.4) +
ggtitle("from data1") +
labs(x = colnames(data1)[i])
p2 <- ggplot(data2, aes(x = data2[, i], fill = Diabetes_binary)) +
geom_density(alpha = 0.4) +
ggtitle("from data2") +
labs(x = colnames(data2)[i])
p3 <- ggplot(data3, aes(x = data3[, i], fill = Diabetes_012)) +
geom_density(alpha = 0.4) +
ggtitle("from data3") +
labs(x = colnames(data3)[i])
print(p1)
print(p2)
print(p3)
}
as we can see, there are some obvious and meaningful difference in the distribution for some of the parameters. however this is just some fancy visualization and we will perform feature importance stuff in future inorder to understand what parameters are actually important.
as the final step of our EDA, we will do a multivariate analysis in order to see the correlations between the parameters
# we undo the change that we have made
data1$Diabetes_binary <- as.numeric(data1$Diabetes_binary) - 1
data2$Diabetes_binary <- as.numeric(data2$Diabetes_binary) - 1
data3$Diabetes_012 <- as.numeric(data3$Diabetes_012) - 1
# this corplots are not so readable because of their size
# I will put their images respectively so that you have a better view
corPlot(data1, cex = 0.5)
corPlot(data2, cex = 0.5)
corPlot(data3, cex = 0.5)
now our EDA stage is also done. we have a good understanding of what our data is and we have some information about different variables. here we will briefly mention some of the key points.
in the beginning we have drawn all the violin plots for all of our datasets. we can see that the distribution of data in our datasets is almost the same. there are only 2 parameters (GenHlth, Age) that have very minor differences in their distribution which would not affect the model output that much.
then we have the KDE plots showing the distribution of the variables referring to different response variable. we did this so that we understand which variables have a meaningful difference in their distributions and can be used for our model parameters. we can see different patterns here. if the plots are like each other, then that variable doesn’t affect the outcome that much (for instance Education) and vice versa.
afterwards we have the correlation plots. in these plots we can see that the variables HighBP, HighCol, BMI, GenHlth, DiffWalk, Age and HeartDisease have the most significant correlation with diabetes. we can also infer that the variables GenHlth, DiffWalk, PhysHlth, Age, Education and Income have a colorful row and have a somehow noticeable correlation with most of the variables.
and finally we have the pairplot. I will put the code that I used to get this plot, but it’s commented because the computation of this plot is very time consuming and resource demanding (it took me around 3 hours to get all 3 of them!) and I will put the picture of them instead.
# plot1 <- ggpairs( data1 )
# ggsave(filename = "/content/plot1.png", plot = plot1, width = 100, height = 50, limitsize = FALSE)
# print("done!")
# plot2 <- ggpairs( data2 )
# ggsave(filename = "/content/plot2.png", plot = plot2, width = 100, height = 50, limitsize = FALSE)
# print("done!")
# plot3 <- ggpairs( data3 )
# ggsave(filename = "/content/plot3.png", plot = plot3, width = 100, height = 50, limitsize = FALSE)
# print("done!")
here are the plots respectively:
knitr::include_graphics('./plots/plot1_1.png')
knitr::include_graphics('./plots/plot2_1.png')